This chapter starts with basic principles for data programming or coding involving data. Data programming is a practice that works and evolves with data. Data programming or coding allows the user to manage and process data in more effective manner. Programs are designed to be replicated or replicable by user and collaborators. A data program can be developed and updated iteratively and incrementally. In other words, it is building on the culminated works without repeating the steps. It takes debugging, which is the process of identifying problems (bugs) but, in fact, updating the program in different situations or with different inputs when used in different contexts, including the programmer himself or herself working in future times.
Social scientists Gentzkow and Shapiro (2014) list out some principles for data programming.
A data program can provide or perform :
R basics
# Create variables composed of random numbers
x <-rnorm(50)
y = rnorm(x)
# Plot the points in the plane
plot(x, y)Using R packages
# Plot better, using the ggplot2 package
## Prerequisite: install and load the ggplot2 package
## install.packages("ggplot2")
library(ggplot2)
qplot(x,y)# Plot better better with ggplot2
x <-rnorm(50)
y = rnorm(x)
ggplot(,aes(x,y)) + theme_bw() + geom_point(col="blue")In this section, we demonstrate exploring data about Taiwan elections in 2016. The Taiwan Election and Democratization Study (TEDS) is one of the longest and most comprehensive elections studies starting in 2001. TEDS collects data through different modes of surveys including face-to-face interviews, telephone interviews and internet surveys. More detail of TEDS can be found at the National Chengchi University Election Study Center website at https://esc.nccu.edu.tw/main.php.
Taiwan Election and Democratization Study 2016 data
# Import the TEDS 2016 data in Stata format using the haven package
##install.packages("haven")
library(haven)
TEDS_2016 <- haven::read_stata("https://github.com/datageneration/home/blob/master/DataProgramming/data/TEDS_2016.dta?raw=true")
# Prepare the analyze the Party ID variable
# Assign label to the values (1=KMT, 2=DPP, 3=NP, 4=PFP, 5=TSU, 6=NPP, 7="NA")
TEDS_2016$PartyID <- factor(TEDS_2016$PartyID, labels=c("KMT","DPP","NP","PFP", "TSU", "NPP","NA"))Take a look at the variable:
## [1] NA NA KMT NA NA DPP
## Levels: KMT DPP NP PFP TSU NPP NA
## [1] NA NA DPP NA NA NA
## Levels: KMT DPP NP PFP TSU NPP NA
Frequency table:
# Run a frequency table of the Party ID variable using the descr package
## install.packages("descr")
library(descr)
freq(TEDS_2016$PartyID)## TEDS_2016$PartyID
## Frequency Percent
## KMT 388 22.9586
## DPP 591 34.9704
## NP 3 0.1775
## PFP 32 1.8935
## TSU 5 0.2959
## NPP 43 2.5444
## NA 628 37.1598
## Total 1690 100.0000
Get a better chart of the Party ID variable:
We can attend to more detail of the chart, such as adding labels to x and y axes, and calculating the percentage instead of counts.
ggplot2::ggplot(TEDS_2016, aes(PartyID)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
ylab("Party Support (%)") +
xlab("Taiwan Political Parties")Adding colors, with another theme:
ggplot2::ggplot(TEDS_2016, aes(PartyID)) +
geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) +
scale_y_continuous(labels=scales::percent) +
ylab("Party Support (%)") +
xlab("Taiwan Political Parties") +
theme_bw()Hold on, colors are not right!
ggplot2::ggplot(TEDS_2016, aes(PartyID)) +
geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) +
scale_y_continuous(labels=scales::percent) +
ylab("Party Support (%)") +
xlab("Taiwan Political Parties") +
theme_bw() +
scale_fill_manual(values=c("steel blue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))To make the chart more meaningful, we can use a package called tidyverse to manage the data.
##install.packages("tidyverse")
library(tidyverse)
TEDS_2016 %>%
count(PartyID) %>%
mutate(perc = n / nrow(TEDS_2016)) -> T2
ggplot2::ggplot(T2, aes(x = reorder(PartyID, -perc),y = perc,fill=PartyID)) +
geom_bar(stat = "identity") +
ylab("Party Support (%)") +
xlab("Taiwan Political Parties") +
theme_bw() +
scale_fill_manual(values=c("steel blue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))In this section, we replicate Dr. Hans Rosling’s well-renowned animated chart depicting world development over time. For more detail, watch this BBC video. Data are drawn from Gapfinder, a foundation established by the Rosling family.
my_packages <- c("tidyverse", "png","gifski", "gapminder", "ggplot2","gganimate","RColorBrewer")
install.packages(my_packages, repos = "http://cran.rstudio.com")##
## The downloaded binary packages are in
## /var/folders/qp/s6y46pq11y13t0gpnf4_v9vm0000gp/T//Rtmp15vADV/downloaded_packages
library(gapminder)
library(ggplot2)
library(gganimate)
library(gifski)
library(png)
library(RColorBrewer)
data("gapminder")
# Basic scatter plot object
mapping <- aes(x =gdpPercap, y = lifeExp,
size = pop, color = continent,
frame = year)
# Note: manual color choices.
ggplot(gapminder, mapping = mapping) +
geom_point() +
theme_linedraw() +
scale_x_log10() +
scale_color_manual(values=c("darkviolet","darkblue","firebrick1","forestgreen","deepskyblue1")) +
labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
geom_text(aes(label=ifelse((country == "China"), "China", ifelse(country=="United States", "United States", ""))),vjust=0,nudge_y = 1,size=6) +
transition_time(year) +
ease_aes('linear')